If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here
To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.
Run the code below to load the file:
weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
skip = 1,
na = "***")You have two objectives in this section:
Select the year and the twelve month variables from the weather dataset. We do not need the others (J-D, D-N, DJF, etc.) for this assignment. Hint: use select() function.
Convert the dataframe from wide to ‘long’ format. Hint: use gather() or pivot_longer() function. Name the new dataframe as tidyweather, name the variable containing the name of the month as month, and the temperature deviation values as delta.
tidyweather <- weather %>%
select (Year, Jan, Feb, Mar, Apr, May,
Jun, Jul, Aug, Sep, Oct, Nov, Dec) %>%
pivot_longer(cols = 2:13, #columns 3 to 5
names_to = "Month",
values_to = "delta")Inspect your dataframe. It should have three variables now, one each for
Let us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.
glimpse(tidyweather)## Rows: 1,704
## Columns: 3
## $ Year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880…
## $ Month <chr> "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "…
## $ delta <dbl> -0.35, -0.50, -0.23, -0.29, -0.05, -0.15, -0.18, -0.25, -0.23, -…
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), Month, "1")),
month = month(date, label=TRUE),
year = year(date))
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point(size = 0.8)+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies"
)Is the effect of increasing temperature more pronounced in some months? Use facet_wrap() to produce a seperate scatter plot for each month, again with a smoothing line. Your chart should human-readable labels; that is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1, 2, 3.
We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().
comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))
glimpse(comparison)## Rows: 1,692
## Columns: 7
## $ Year <dbl> 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1…
## $ Month <chr> "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"…
## $ delta <dbl> -0.31, -0.22, -0.04, 0.00, 0.04, -0.32, 0.08, -0.04, -0.26, -…
## $ date <date> 1881-01-01, 1881-02-01, 1881-03-01, 1881-04-01, 1881-05-01, …
## $ month <ord> Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, J…
## $ year <dbl> 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1…
## $ interval <chr> "1881-1920", "1881-1920", "1881-1920", "1881-1920", "1881-192…
Inspect the comparison dataframe by clicking on it in the Environment pane.
Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in. Set fill to interval to group and colour the data by different time periods.
ggplot(comparison, aes(x=delta, fill=interval))+
geom_density(alpha=0.2) + #density plot with tranparency set to 20%
theme_bw() + #theme
labs (
title = "Density Plot for Monthly Temperature Anomalies",
y = "Density" #changing y-axis label to sentence case
)+
NULLSo far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.
#creating yearly averages
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>% #grouping data by Year
# creating summaries for mean delta
# use `na.rm=TRUE` to eliminate NA (not available) values
summarise(annual_average_delta = mean(delta, na.rm=TRUE))
glimpse(average_annual_anomaly)## Rows: 142
## Columns: 2
## $ Year <dbl> 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1…
## $ annual_average_delta <dbl> -0.2808, -0.1733, -0.2083, -0.2742, -0.4217, -0.4…
#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
geom_point()+
#Fit the best fit line, using LOESS method
geom_smooth() +
#change to theme_bw() to have white background + black frame around plot
theme_bw() +
labs (
title = "Average Yearly Anomaly",
y = "Average Annual Delta"
) deltaNASA points out on their website that
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
Your task is to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.
formula_ci <- comparison %>%
group_by(interval) %>%
summarise(mean_delta = mean(delta, na.rm=TRUE),
median_delta = median(delta, na.rm=TRUE),
sd_delta = sd(delta, na.rm=TRUE),
count_delta = n(),
se_delta = sd_delta / sqrt(count_delta),
ci_delta_up = mean_delta + qt(.975, count_delta-1)*se_delta ,
ci_delta_dw = mean_delta - qt(.975, count_delta-1)*se_delta
)
# choose the interval 2011-present
# what dplyr verb will you use?
# calculate summary statistics for temperature deviation (delta)
# calculate mean, SD, count, SE, lower/upper 95% CI
# what dplyr verb will you use?
#print out formula_CI
print(formula_ci)## # A tibble: 5 × 8
## interval mean_delta median_delta sd_delta count_delta se_delta ci_delta_up
## <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1881-1920 -0.338 -0.33 0.236 480 0.0108 -0.317
## 2 1921-1950 -0.0186 -0.025 0.233 360 0.0123 0.00546
## 3 1951-1980 -0.0000833 0.01 0.197 360 0.0104 0.0204
## 4 1981-2010 0.468 0.475 0.315 360 0.0166 0.500
## 5 2011-present 1.06 1.04 0.276 132 0.0240 1.11
## # … with 1 more variable: ci_delta_dw <dbl>
# use the infer package to construct a 95% CI for delta
set.seed(1)
#Calculate the CI for the relevant interval...
#2011 - present
boot_weather_1 <- comparison %>%
filter(interval == "2011-present") %>%
specify(response = delta) %>%
generate(reps=100, type="bootstrap") %>%
calculate(stat = "mean")
percentile_ci_1<- boot_weather_1 %>%
get_confidence_interval(level = 0.95, type = "percentile")
print(percentile_ci_1)## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.02 1.11
What is the data showing us? Please type your answer after (and outside!) this blockquote. You have to explain what you have done, and the interpretation of the result. One paragraph max, please!
As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval
# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv')
glimpse(approval_polllist)## Rows: 1,597
## Columns: 22
## $ president <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate <chr> "1/24/2021", "1/24/2021", "1/25/2021", "1/25/2021"…
## $ enddate <chr> "1/26/2021", "1/27/2021", "1/27/2021", "1/26/2021"…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Maris…
## $ grade <chr> "B", "A", "B", "B", "B", "B", "B", "B+", "B", "B",…
## $ samplesize <dbl> 1500, 1313, 1500, 2200, 15000, 9212, 1500, 906, 15…
## $ population <chr> "lv", "a", "lv", "a", "a", "a", "lv", "a", "a", "a…
## $ weight <dbl> 0.3225, 2.1893, 0.3025, 0.1205, 0.2739, 0.1520, 0.…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48.0, 49.0, 49.0, 58.0, 54.0, 55.0, 50.0, 57.0, 54…
## $ disapprove <dbl> 48.0, 35.0, 48.0, 32.0, 31.0, 31.5, 45.0, 37.0, 32…
## $ adjusted_approve <dbl> 50.4, 50.0, 51.4, 56.5, 52.5, 53.5, 52.4, 56.3, 52…
## $ adjusted_disapprove <dbl> 41.9, 34.6, 41.9, 35.4, 34.4, 34.9, 38.9, 36.4, 35…
## $ multiversions <chr> NA, NA, NA, NA, NA, "*", NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, TRUE, NA, TRUE, NA…
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74261, 74320, 74268, 74346, 74277, 74292, 74290, 7…
## $ question_id <dbl> 139433, 139558, 139483, 139653, 139497, 139518, 13…
## $ createddate <chr> "1/27/2021", "2/1/2021", "1/28/2021", "2/5/2021", …
## $ timestamp <chr> "18:35:08 10 Sep 2021", "18:35:08 10 Sep 2021", "1…
# Use `lubridate` to fix dates, as they are given as characters.What I would like you to do is to calculate the average net approval rate (approve- disapprove) for each week since he got into office. I want you plot the net approval, along with its 95% confidence interval. There are various dates given for each poll, please use enddate, i.e., the date the poll ended.
Also, please add an orange line at zero. Your plot should look like this:
glimpse(approval_polllist)## Rows: 1,597
## Columns: 22
## $ president <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate <chr> "1/24/2021", "1/24/2021", "1/25/2021", "1/25/2021"…
## $ enddate <chr> "1/26/2021", "1/27/2021", "1/27/2021", "1/26/2021"…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Maris…
## $ grade <chr> "B", "A", "B", "B", "B", "B", "B", "B+", "B", "B",…
## $ samplesize <dbl> 1500, 1313, 1500, 2200, 15000, 9212, 1500, 906, 15…
## $ population <chr> "lv", "a", "lv", "a", "a", "a", "lv", "a", "a", "a…
## $ weight <dbl> 0.3225, 2.1893, 0.3025, 0.1205, 0.2739, 0.1520, 0.…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48.0, 49.0, 49.0, 58.0, 54.0, 55.0, 50.0, 57.0, 54…
## $ disapprove <dbl> 48.0, 35.0, 48.0, 32.0, 31.0, 31.5, 45.0, 37.0, 32…
## $ adjusted_approve <dbl> 50.4, 50.0, 51.4, 56.5, 52.5, 53.5, 52.4, 56.3, 52…
## $ adjusted_disapprove <dbl> 41.9, 34.6, 41.9, 35.4, 34.4, 34.9, 38.9, 36.4, 35…
## $ multiversions <chr> NA, NA, NA, NA, NA, "*", NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, TRUE, NA, TRUE, NA…
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74261, 74320, 74268, 74346, 74277, 74292, 74290, 7…
## $ question_id <dbl> 139433, 139558, 139483, 139653, 139497, 139518, 13…
## $ createddate <chr> "1/27/2021", "2/1/2021", "1/28/2021", "2/5/2021", …
## $ timestamp <chr> "18:35:08 10 Sep 2021", "18:35:08 10 Sep 2021", "1…
#prepare the dataset
#calculate the margin
approval_polllist$enddate <- mdy(approval_polllist$enddate)
new_approval <- approval_polllist %>%
filter(subgroup=="Voters") %>%
mutate(net_approve = approve - disapprove, year = year(enddate), week = week(enddate))
glimpse(new_approval)## Rows: 380
## Columns: 25
## $ president <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup <chr> "Voters", "Voters", "Voters", "Voters", "Voters", …
## $ modeldate <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate <chr> "1/24/2021", "1/25/2021", "1/24/2021", "1/26/2021"…
## $ enddate <date> 2021-01-26, 2021-01-27, 2021-01-27, 2021-01-28, 2…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Rasmu…
## $ grade <chr> "B", "B", "A", "B", "A+", "B-", "B/C", "B+", "B", …
## $ samplesize <dbl> 1500, 1500, 1153, 1500, 1002, 1200, 841, 945, 1500…
## $ population <chr> "lv", "lv", "rv", "lv", "rv", "rv", "lv", "rv", "l…
## $ weight <dbl> 0.322, 0.303, 2.003, 0.286, 1.945, 0.881, 1.115, 1…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48.0, 49.0, 50.0, 50.0, 58.0, 58.0, 56.1, 61.0, 49…
## $ disapprove <dbl> 48.0, 48.0, 36.0, 45.0, 29.0, 35.0, 36.4, 39.0, 46…
## $ adjusted_approve <dbl> 50.5, 51.5, 50.9, 52.5, 56.2, 57.2, 54.8, 56.8, 51…
## $ adjusted_disapprove <dbl> 42.6, 42.6, 35.5, 39.6, 32.7, 36.5, 36.7, 40.4, 40…
## $ multiversions <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking <lgl> TRUE, TRUE, NA, TRUE, NA, NA, NA, NA, TRUE, TRUE, …
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74261, 74268, 74320, 74290, 74321, 74355, 74326, 7…
## $ question_id <dbl> 139433, 139483, 139559, 139515, 139561, 139680, 13…
## $ createddate <chr> "1/27/2021", "1/28/2021", "2/1/2021", "1/29/2021",…
## $ timestamp <chr> "18:35:11 10 Sep 2021", "18:35:11 10 Sep 2021", "1…
## $ net_approve <dbl> 0.0, 1.0, 14.0, 5.0, 29.0, 23.0, 19.7, 22.0, 3.0, …
## $ year <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 20…
## $ week <dbl> 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
#calculate the CI for average approval margin
margin_ci <- new_approval %>%
group_by(week) %>%
summarise(avg_approval_margin = mean(net_approve),
sd_margin = sd(net_approve, na.rm=TRUE),
count_margin = n(),
se_margin = sd_margin / sqrt(count_margin),
ci_margin_up = avg_approval_margin + qt(.975, count_margin-1)*se_margin ,
ci_margin_dw = avg_approval_margin - qt(.975, count_margin-1)*se_margin
)
glimpse(margin_ci)## Rows: 33
## Columns: 7
## $ week <dbl> 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ avg_approval_margin <dbl> 5.00, 13.84, 11.89, 12.85, 11.67, 9.29, 8.25, 11.5…
## $ sd_margin <dbl> 6.38, 8.95, 9.78, 8.40, 9.49, 7.94, 6.94, 8.14, 10…
## $ count_margin <int> 4, 14, 9, 13, 13, 14, 12, 11, 15, 10, 8, 15, 14, 1…
## $ se_margin <dbl> 3.19, 2.39, 3.26, 2.33, 2.63, 2.12, 2.00, 2.45, 2.…
## $ ci_margin_up <dbl> 15.15, 19.01, 19.40, 17.93, 17.40, 13.87, 12.66, 1…
## $ ci_margin_dw <dbl> -5.1473, 8.6662, 4.3728, 7.7672, 5.9303, 4.7033, 3…
#plot the for average approval margin
margin_ci %>%
ggplot(aes(week)) +
geom_ribbon(aes(ymin = ci_margin_up, ymax = ci_margin_dw), fill="grey", alpha = 0.5)+
geom_line(aes(y=avg_approval_margin, group=1), color = "red", size = 0.3) +
geom_point(aes(y=avg_approval_margin, group=1), color = "red", size = 0.3) +
geom_smooth(aes(y=avg_approval_margin)) +
geom_line(aes(y=ci_margin_up, group=1), color="red", size = 0.3)+
geom_line(aes(y=ci_margin_dw, group=1), color="red", size = 0.3)+
#add the aesthetics for the graph
geom_hline(yintercept=0, linetype="solid", color = "orange", size = 1)+
labs(title="Estimating Approval Margin (approve-disapprove) for Joe Biden",
subtitle = "Weekly average of all polls",
x = "Week of the year",
y = "Average Approval Margin (Approve - Disapprove") +
annotate("text", x = 20, y = 20, label = "2021", color = 'black',size = 3)+
scale_y_continuous(
labels = scales::number_format(accuracy = 0.1))+
theme_minimal()+
NULLCompare the confidence intervals for week 3 and week 25. Can you explain what’s going on? One paragraph would be enough.
#calculate the week4 and week5 confidence intervals
margin_ci_comparison <- new_approval %>%
filter(week==4 | week==25) %>%
group_by(week) %>%
summarise(avg_approval_margin = mean(net_approve),
sd_margin = sd(net_approve, na.rm=TRUE),
count_margin = n(),
se_margin = sd_margin / sqrt(count_margin),
ci_margin_up = avg_approval_margin + qt(.975, count_margin-1)*se_margin ,
ci_margin_dw = avg_approval_margin - qt(.975, count_margin-1)*se_margin
)
#plot the confidence intervals
margin_ci_comparison %>%
ggplot(aes(x=avg_approval_margin, y=week, color=week))+
geom_errorbarh(aes(xmin=ci_margin_dw,xmax=ci_margin_up))+
geom_point(aes(x=avg_approval_margin, y=week), size=6)+
labs(title="Comparison of confidence intervals for week4 and week 25",
x = "Average approval margin",
y = "Week")#We can see that the average approval margin for week4 is larger than that in week25. However, the confidence interval in week4 [6.12, 19.4] and week25 [6.36, 11.5] overlap. Therefore, based on the confidence interval graphs generated below, we cannot tell if on average, Biden has significantly lower approval margin in week25 than that in week4.comparison_approval <- new_approval %>%
filter(week==4 | week==25)
t.test(net_approve ~ week, data = comparison_approval)##
## Welch Two Sample t-test
##
## data: net_approve by week
## t = -1, df = 4, p-value = 0.3
## alternative hypothesis: true difference in means between group 4 and group 25 is not equal to 0
## 95 percent confidence interval:
## -13.51 5.61
## sample estimates:
## mean in group 4 mean in group 25
## 5.00 8.95
#Based on the result below, we cannot reject the null hypothesis. Recall the gapminder data frame from the gapminder package. That data frame contains just six columns from the larger data in Gapminder World. In this part, you will join a few dataframes with more data than the ‘gapminder’ package. Specifically, you will look at data on
You must use the wbstats package to download data from the World Bank. The relevant World Bank indicators are SP.DYN.TFRT.IN, SE.PRM.NENR, NY.GDP.PCAP.KD, and SH.DYN.MORT
library(gapminder)
skim(gapminder)| Name | gapminder |
| Number of rows | 1704 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| country | 0 | 1 | FALSE | 142 | Afg: 12, Alb: 12, Alg: 12, Ang: 12 |
| continent | 0 | 1 | FALSE | 5 | Afr: 624, Asi: 396, Eur: 360, Ame: 300 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 1.98e+03 | 1.73e+01 | 1952.0 | 1.97e+03 | 1.98e+03 | 1.99e+03 | 2.01e+03 | ▇▅▅▅▇ |
| lifeExp | 0 | 1 | 5.95e+01 | 1.29e+01 | 23.6 | 4.82e+01 | 6.07e+01 | 7.08e+01 | 8.26e+01 | ▁▆▇▇▇ |
| pop | 0 | 1 | 2.96e+07 | 1.06e+08 | 60011.0 | 2.79e+06 | 7.02e+06 | 1.96e+07 | 1.32e+09 | ▇▁▁▁▁ |
| gdpPercap | 0 | 1 | 7.22e+03 | 9.86e+03 | 241.2 | 1.20e+03 | 3.53e+03 | 9.33e+03 | 1.14e+05 | ▇▁▁▁▁ |
# load gapminder HIV data
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))
# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")
library(wbstats)
worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
indicator = indicators,
start_date = 1960,
end_date = 2016)
# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels, from the World Bank API
countries <- wbstats::wb_cachelist$countriesYou have to join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. You may need to tidy your data first and then perform join operations. Think about what type makes the most sense and explain why you chose it. * Left Join makes the most sense among the different types of join operations like outer joins - left, right, and full. This is because a left join operation (regardless of there being a match) preserves the original observations, especially when one looks up additional data from another table. Since, we are working with 3 dataframes, while mapping the year and date column with different start/end time frames, it is essential to preserve the original observations in each dataframe.
# tidying HIV data - hiv and life_expectancy dataframes - using pivot_longer() + remving NA values
hiv1 <- hiv %>%
pivot_longer(2:34, names_to = "year", values_to = "Percentage_HIV_Cases_Age_15_49") %>%
drop_na()
skim(hiv1)| Name | hiv1 |
| Number of rows | 3301 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 24 | 0 | 154 | 0 |
| year | 0 | 1 | 4 | 4 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Percentage_HIV_Cases_Age_15_49 | 0 | 1 | 1.74 | 4.09 | 0.01 | 0.1 | 0.3 | 1.2 | 26.5 | ▇▁▁▁▁ |
life_expectancy_1 <- life_expectancy %>%
pivot_longer(2:302, names_to = "year", values_to = "Life_Expectancy") %>%
drop_na()
skim(life_expectancy_1)| Name | life_expectancy_1 |
| Number of rows | 55528 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 30 | 0 | 187 | 0 |
| year | 0 | 1 | 4 | 4 | 0 | 301 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Life_Expectancy | 0 | 1 | 53 | 21.7 | 1.01 | 32.3 | 48.7 | 74.2 | 94.8 | ▁▇▂▅▅ |
#Removing NA values in worldbank_data
worldbank_data_1 <- worldbank_data %>%
drop_na()
skim(worldbank_data_1)| Name | worldbank_data_1 |
| Number of rows | 3715 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| iso2c | 0 | 1 | 2 | 2 | 0 | 177 | 0 |
| iso3c | 0 | 1 | 3 | 3 | 0 | 177 | 0 |
| country | 0 | 1 | 4 | 30 | 0 | 177 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| date | 0 | 1 | 1998.57 | 12.65 | 1970.00 | 1989.00 | 2001.0 | 2009.00 | 2.02e+03 | ▂▃▃▆▇ |
| NY.GDP.PCAP.KD | 0 | 1 | 12893.37 | 18110.77 | 164.46 | 1514.19 | 4313.9 | 17178.37 | 1.16e+05 | ▇▁▁▁▁ |
| SE.PRM.NENR | 0 | 1 | 85.08 | 17.09 | 10.05 | 81.05 | 91.7 | 96.43 | 1.00e+02 | ▁▁▁▂▇ |
| SH.DYN.MORT | 0 | 1 | 52.96 | 59.80 | 2.20 | 10.90 | 26.7 | 76.35 | 3.70e+02 | ▇▂▁▁▁ |
| SP.DYN.TFRT.IN | 0 | 1 | 3.35 | 1.82 | 1.08 | 1.83 | 2.7 | 4.82 | 8.45e+00 | ▇▃▂▂▁ |
#Left Join of life_expectancy_1 (tidied life_expectancy) and hiv1 (tidied hiv) dataframes
join_1 <- left_join(life_expectancy_1, hiv1, by = c("country"="country", "year"="year"))
# Converting datatype of column (year) from character type to numeric type
join_1$year = as.numeric(join_1$year)#Left Join of join_1 (left join of hiv1 and life_expectancy_1 dataframes) and worldbank_data dataframes
join_2 <- left_join(join_1, worldbank_data_1, by = c("country"="country", "year"="date"))#Left Join of join_1 (left join of hiv1 and life_expectancy_1 dataframes) and worldbank_data dataframes
join_3 <- left_join(join_2, countries, "country"="country") %>%
drop_na()
join_3## # A tibble: 1,083 × 25
## country year Life_Expectancy Percentage_HIV_Case… iso2c iso3c NY.GDP.PCAP.KD
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 Algeria 1990 71.7 0.06 DZ DZA 3572.
## 2 Algeria 1991 72.2 0.06 DZ DZA 3444.
## 3 Algeria 1992 72.5 0.06 DZ DZA 3424.
## 4 Algeria 1993 72.7 0.06 DZ DZA 3279.
## 5 Algeria 1994 72.8 0.06 DZ DZA 3183.
## 6 Algeria 1995 72.9 0.06 DZ DZA 3241.
## 7 Algeria 1996 73.3 0.06 DZ DZA 3315.
## 8 Algeria 1997 73.2 0.06 DZ DZA 3298.
## 9 Algeria 1999 73.9 0.06 DZ DZA 3474.
## 10 Algeria 2000 74 0.06 DZ DZA 3558.
## # … with 1,073 more rows, and 18 more variables: SE.PRM.NENR <dbl>,
## # SH.DYN.MORT <dbl>, SP.DYN.TFRT.IN <dbl>, capital_city <chr>,
## # longitude <dbl>, latitude <dbl>, region_iso3c <chr>, region_iso2c <chr>,
## # region <chr>, admin_region_iso3c <chr>, admin_region_iso2c <chr>,
## # admin_region <chr>, income_level_iso3c <chr>, income_level_iso2c <chr>,
## # income_level <chr>, lending_type_iso3c <chr>, lending_type_iso2c <chr>,
## # lending_type <chr>
#Scatterplot for Life Expectancy vs. HIV prevalence
ggplot(join_3, aes(x = Percentage_HIV_Cases_Age_15_49, y = Life_Expectancy)) +
geom_point(size = 0.3) +
geom_smooth(method="lm") +
facet_wrap(~ region) +
labs(title = "Relationship between HIV Prevalence and Life Expectancy",
x = "HIV Prevalence",
y = "Life Expectancy")ggplot(join_3, aes(x = SP.DYN.TFRT.IN, y = NY.GDP.PCAP.KD)) +
geom_point(size = 0.5) +
geom_smooth(method="lm") +
facet_wrap(~ region) +
labs(title = "Relationship between Fertility Rate and GDP per capita",
x = "Fertility Rate",
y = "GDP per capita")+
NULLgeom_col()), in descending order. Region ‘Sub-Saharan Africa’ has the most observations with missing HIV data. This is followed by Europe & Central Asia with the 2nd most observations with missing HIV data.#Tidying hiv dataframe
tidy_hiv <- hiv %>%
pivot_longer(cols=2:34, names_to="year", values_to = "Percentage_HIV_Cases_Age_15_49")
#Left joining tidy_hiv and countries dataframes
joined_hiv_countries <- left_join(tidy_hiv, countries, "country"= "country")
joined_hiv_countries## # A tibble: 5,082 × 20
## country year Percentage_HIV_… iso3c iso2c capital_city longitude latitude
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1979 NA AFG AF Kabul 69.2 34.5
## 2 Afghanistan 1980 NA AFG AF Kabul 69.2 34.5
## 3 Afghanistan 1981 NA AFG AF Kabul 69.2 34.5
## 4 Afghanistan 1982 NA AFG AF Kabul 69.2 34.5
## 5 Afghanistan 1983 NA AFG AF Kabul 69.2 34.5
## 6 Afghanistan 1984 NA AFG AF Kabul 69.2 34.5
## 7 Afghanistan 1985 NA AFG AF Kabul 69.2 34.5
## 8 Afghanistan 1986 NA AFG AF Kabul 69.2 34.5
## 9 Afghanistan 1987 NA AFG AF Kabul 69.2 34.5
## 10 Afghanistan 1988 NA AFG AF Kabul 69.2 34.5
## # … with 5,072 more rows, and 12 more variables: region_iso3c <chr>,
## # region_iso2c <chr>, region <chr>, admin_region_iso3c <chr>,
## # admin_region_iso2c <chr>, admin_region <chr>, income_level_iso3c <chr>,
## # income_level_iso2c <chr>, income_level <chr>, lending_type_iso3c <chr>,
## # lending_type_iso2c <chr>, lending_type <chr>
# Determining NA values in joined_hiv_countries dataframe
joined_hiv_countries %>%
filter(!is.na(region)) %>%
group_by(region) %>%
summarise(missing_hiv_values=sum(is.na(Percentage_HIV_Cases_Age_15_49))) %>%
mutate(
region=fct_reorder(region,-missing_hiv_values)) %>%
# Plotting Bar Chart of Region Specific Missing HIV Data in Descending Order
ggplot(aes(x=region,y=missing_hiv_values))+
geom_col()+
labs(title="Bar Chart of Region Specific Missing HIV Data in Descending Order",
x= "Region",
y= "Missing NA values in HIV data"
) 1. How has mortality rate for under 5 changed by region? In each region, find the top 5 countries that have seen the greatest improvement, as well as those 5 countries where mortality rates have had the least improvement or even deterioration.
#Minimum Year and Maximum Year determination
mortality <- join_3 %>%
group_by(country) %>%
summarise(minimum_year=min(year), maximum_year=max(year))
mortality## # A tibble: 92 × 3
## country minimum_year maximum_year
## <chr> <dbl> <dbl>
## 1 Algeria 1990 2008
## 2 Angola 1998 2011
## 3 Argentina 1991 2011
## 4 Armenia 2002 2011
## 5 Azerbaijan 1991 2011
## 6 Bangladesh 1990 2010
## 7 Belarus 1995 2011
## 8 Belize 1999 2011
## 9 Benin 1984 2011
## 10 Bhutan 1998 2011
## # … with 82 more rows
# Dataframe with original mortality rates
join_5 <- left_join(join_3, mortality, "country"="country") %>% #Left Joining join_3 and mortality dataframes
select(country, year, minimum_year, maximum_year, SH.DYN.MORT, region) %>% # Selecting required columns
mutate(
original_mortality = ifelse(year == minimum_year, SH.DYN.MORT, 0)) %>% #Determining original mortality rates
select(!year) %>%
filter(!original_mortality == 0)%>%
select(!SH.DYN.MORT)
join_5## # A tibble: 92 × 5
## country minimum_year maximum_year region original_mortal…
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 Algeria 1990 2008 Middle East & North Africa 49.5
## 2 Angola 1998 2011 Sub-Saharan Africa 214.
## 3 Argentina 1991 2011 Latin America & Caribbean 28.1
## 4 Armenia 2002 2011 Europe & Central Asia 27.8
## 5 Azerbaijan 1991 2011 Europe & Central Asia 95.3
## 6 Bangladesh 1990 2010 South Asia 144.
## 7 Belarus 1995 2011 Europe & Central Asia 15.6
## 8 Belize 1999 2011 Latin America & Caribbean 24.4
## 9 Benin 1984 2011 Sub-Saharan Africa 199.
## 10 Bhutan 1998 2011 South Asia 86.1
## # … with 82 more rows
# Dataframe with final mortality rates
join_6 <- left_join(join_3, mortality, "country"="country") %>% #Left Joining join_3 and mortality dataframes
select(country, year, minimum_year, maximum_year, SH.DYN.MORT, region) %>% # Selecting required columns
mutate(
final_mortality = ifelse(year == maximum_year, SH.DYN.MORT, 0)) %>% #Determining final mortality rates
select(!year) %>%
filter(!final_mortality == 0)%>%
select(!SH.DYN.MORT)
join_6## # A tibble: 92 × 5
## country minimum_year maximum_year region final_mortality
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 Algeria 1990 2008 Middle East & North Africa 29.5
## 2 Angola 1998 2011 Sub-Saharan Africa 112.
## 3 Argentina 1991 2011 Latin America & Caribbean 13.9
## 4 Armenia 2002 2011 Europe & Central Asia 17.6
## 5 Azerbaijan 1991 2011 Europe & Central Asia 35
## 6 Bangladesh 1990 2010 South Asia 48.7
## 7 Belarus 1995 2011 Europe & Central Asia 5.1
## 8 Belize 1999 2011 Latin America & Caribbean 18.3
## 9 Benin 1984 2011 Sub-Saharan Africa 109.
## 10 Bhutan 1998 2011 South Asia 39.9
## # … with 82 more rows
# Joining aforementioned join_5 and join_6 dataframes along with Improvement in Mortality Rates calculation
join_7 <- left_join(join_5, join_6, "country"="country") %>% #Left Joining join_5 and join_6
mutate(
mortality_improvement = (final_mortality - original_mortality)/original_mortality) %>%
arrange(desc(mortality_improvement))
join_7## # A tibble: 92 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 South Af… 1991 2005 Sub-Sah… 56 79.1
## 2 Gabon 1997 1997 Sub-Sah… 87.2 87.2
## 3 Haiti 1997 1997 Latin A… 115. 115.
## 4 South Su… 2011 2011 Sub-Sah… 102. 102.
## 5 Sudan 2011 2011 Sub-Sah… 73.4 73.4
## 6 Lesotho 1984 2011 Sub-Sah… 101. 96.1
## 7 Zimbabwe 1998 2003 Sub-Sah… 95.6 90.6
## 8 Sao Tome… 2009 2010 Sub-Sah… 47.9 45
## 9 Uzbekist… 2007 2008 Europe … 41.3 38.5
## 10 Ethiopia 2009 2011 Sub-Sah… 86.7 77.6
## # … with 82 more rows, and 1 more variable: mortality_improvement <dbl>
# Determining Change in Mortality Rate for Under 5 in each region
mortality_improvement_by_region <- join_7 %>%
group_by(region) %>%
summarise(mean_mortality_improvement_rate = 100 * mean(mortality_improvement))
mortality_improvement_by_region## # A tibble: 6 × 2
## region mean_mortality_improvement_rate
## <chr> <dbl>
## 1 East Asia & Pacific -40.0
## 2 Europe & Central Asia -48.1
## 3 Latin America & Caribbean -48.9
## 4 Middle East & North Africa -56.5
## 5 South Asia -48.0
## 6 Sub-Saharan Africa -32.0
# Plot for Region-Specific Change in Mortality Rate for Under 5
ggplot(mortality_improvement_by_region, aes(x = mean_mortality_improvement_rate, y = fct_reorder(region, -mean_mortality_improvement_rate))) +
geom_col()+
labs(
title= "Region-Specific Change in Mortality Rate for Under 5",
x= "Change in Mortality Rate for Under 5",
y="Region"
)#Sub-Saharan Africa
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
filter(region == "Sub-Saharan Africa") %>%
slice_max(order_by = mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 South Africa 1991 2005 Sub-S… 56 79.1
## 2 Gabon 1997 1997 Sub-S… 87.2 87.2
## 3 South Sudan 2011 2011 Sub-S… 102. 102.
## 4 Sudan 2011 2011 Sub-S… 73.4 73.4
## 5 Lesotho 1984 2011 Sub-S… 101. 96.1
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
filter(region == "Sub-Saharan Africa") %>%
slice_max(order_by = -mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Senegal 1981 2011 Sub-Sah… 198. 63
## 2 Niger 1990 2011 Sub-Sah… 329. 115.
## 3 Eritrea 1992 2011 Sub-Sah… 138. 53.2
## 4 Mozambique 1990 2011 Sub-Sah… 243. 100.
## 5 Tanzania 1990 2010 Sub-Sah… 165. 71.9
## # … with 1 more variable: mortality_improvement <dbl>
#South Asia
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
filter(region == "South Asia") %>%
slice_max(order_by = mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Pakistan 2002 2011 South Asia 102. 85
## 2 Sri Lanka 2001 2011 South Asia 16 11.2
## 3 Nepal 1999 2011 South Asia 85.8 44.2
## 4 India 1990 2009 South Asia 126. 61.4
## 5 Bhutan 1998 2011 South Asia 86.1 39.9
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
filter(region == "South Asia") %>%
slice_max(order_by = -mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Maldives 1997 2008 South Asia 52.8 16
## 2 Bangladesh 1990 2010 South Asia 144. 48.7
## 3 Bhutan 1998 2011 South Asia 86.1 39.9
## 4 India 1990 2009 South Asia 126. 61.4
## 5 Nepal 1999 2011 South Asia 85.8 44.2
## # … with 1 more variable: mortality_improvement <dbl>
#Latin America & Caribbean
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
filter(region == "Latin America & Caribbean") %>%
slice_max(order_by = mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Haiti 1997 1997 Latin… 115. 115.
## 2 Suriname 2005 2011 Latin… 26.5 22.5
## 3 Dominican Republic 1999 2011 Latin… 42 33.7
## 4 Belize 1999 2011 Latin… 24.4 18.3
## 5 Costa Rica 1990 2011 Latin… 16.9 10.2
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
filter(region == "Latin America & Caribbean") %>%
slice_max(order_by = -mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Ecuador 1979 2011 Latin Am… 96.1 17.5
## 2 Honduras 1979 2008 Latin Am… 99.3 25.5
## 3 Peru 1993 2011 Latin Am… 67.1 18.9
## 4 Nicaragua 1986 2010 Latin Am… 78.7 23.9
## 5 Colombia 1984 2011 Latin Am… 45.1 17.8
## # … with 1 more variable: mortality_improvement <dbl>
#Europe & Central Asia
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
filter(region == "Europe & Central Asia") %>%
slice_max(order_by = mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Uzbekistan 2007 2008 Europe … 41.3 38.5
## 2 Serbia 2005 2011 Europe … 8.9 7.4
## 3 Ukraine 2002 2011 Europe … 16.7 11.2
## 4 Armenia 2002 2011 Europe … 27.8 17.6
## 5 Bulgaria 1996 2011 Europe … 19.2 10.4
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
filter(region == "Europe & Central Asia") %>%
slice_max(order_by = -mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Turkey 1990 2011 Europe … 73.9 17
## 2 Georgia 1995 2011 Europe … 45.2 13
## 3 Belarus 1995 2011 Europe … 15.6 5.1
## 4 Azerbaijan 1991 2011 Europe … 95.3 35
## 5 Kazakhstan 2000 2011 Europe … 42.2 18.3
## # … with 1 more variable: mortality_improvement <dbl>
#Middle East & North Africa
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
filter(region == "Middle East & North Africa") %>%
slice_max(order_by = mortality_improvement, n = 5)## # A tibble: 3 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Algeria 1990 2008 Middle Eas… 49.5 29.5
## 2 Morocco 1990 2011 Middle Eas… 79.1 30.4
## 3 Tunisia 1990 2011 Middle Eas… 55.3 18
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
filter(region == "Middle East & North Africa") %>%
slice_max(order_by = -mortality_improvement, n = 5)## # A tibble: 3 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Tunisia 1990 2011 Middle Eas… 55.3 18
## 2 Morocco 1990 2011 Middle Eas… 79.1 30.4
## 3 Algeria 1990 2008 Middle Eas… 49.5 29.5
## # … with 1 more variable: mortality_improvement <dbl>
#East Asia & Pacific
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
filter(region == "East Asia & Pacific") %>%
slice_max(order_by = mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Fiji 1992 2011 East A… 26.9 23.6
## 2 Thailand 2006 2009 East A… 16.4 14.2
## 3 Myanmar 2000 2010 East A… 89 63.4
## 4 Vietnam 1998 2011 East A… 32.8 22.6
## 5 Philippines 1992 2009 East A… 50.4 32.1
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
filter(region == "East Asia & Pacific") %>%
slice_max(order_by = -mortality_improvement, n = 5)## # A tibble: 5 × 7
## country minimum_year maximum_year region original_mortal… final_mortality
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Mongolia 1995 2011 East A… 88.1 27.9
## 2 Cambodia 1997 2011 East A… 120. 40.6
## 3 Indonesia 1990 2011 East A… 84 32.6
## 4 Malaysia 1994 2011 East A… 13.9 8
## 5 Philippines 1992 2009 East A… 50.4 32.1
## # … with 1 more variable: mortality_improvement <dbl>
ggplot(join_3, aes(x = SE.PRM.NENR, y = SP.DYN.TFRT.IN)) +
geom_point() +
geom_smooth(method="lm") +
labs(title = "Relationship between Primary School Enrollment and Fertility Rate",
x = "Primary School Enrollment",
y = "Fertility Rate")Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-08-23T14%3A32%3A29/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20210912%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20210912T211134Z&X-Amz-Expires=300&X-Amz-Signature=213cc5c78a7babb89325fd0e572fabaedd5ec17b0ab3b39c1cccc719b6f71386&X-Amz-SignedHeaders=host]
## Date: 2021-09-12 21:11
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 173 kB
## <ON DISK> /var/folders/7g/r4r0bq0n6v7d8lh2hqs28_mm0000gn/T//RtmpOPecfM/file1641a572a6ba5.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))We can easily create a facet grid that plots bikes hired by month and year.
Look at May and Jun and compare 2020 with the previous years. What’s happening?
According to the above graph, most of the months are of a right skewed distribution while May and June are close to a normal distribution. Being right skewed means that the average bike rental value is smaller than that of which distribution is normal. During summer, it is both predicted that bike rentals go up, and this explains the observation for May and June in this graph. It can also be seen that for the lower hire counts, their frequencies are very low. This indicates that bike rental was so popular that the very few low hire counts were observed.
For year 2020, the pandemic obviously had heavily struck the bike rental industry. Overall, the frequency of renting a bike has decreased significantly compared to the other years. However, it is interesting to find that for the lower hire counts in May and June, their frequency increased in 2020 with regard to previous years. As it gets less popular for people to go out and rent bikes, the lower hire counts would appear more frequently than before. In fact, all the plots for 2020 are more right skewed than they were in the other years, as the data set shifted to lower hire counts due to the pandemic.
However, the challenge I want you to work on is to reproduce the following two graphs.
The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).
Should you use the mean or the median to calculate your expected rentals? Why?
In creating your plots, you may find these links useful:
First, we take a glimpse of the dataset.
glimpse(bike)## Rows: 4,020
## Columns: 5
## $ day <dttm> 2010-07-30, 2010-07-31, 2010-08-01, 2010-08-02, 2010-08-0…
## $ bikes_hired <dbl> 6897, 5564, 4303, 6642, 7966, 7893, 8724, 9797, 6631, 7864…
## $ year <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010…
## $ month <ord> Jul, Jul, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug…
## $ week <dbl> 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32…
Then we look at the data if there’s missing values.
skim(bike)| Name | bike |
| Number of rows | 4020 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 3 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| month | 0 | 1 | TRUE | 12 | Jul: 343, Jan: 341, Mar: 341, May: 341 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bikes_hired | 0 | 1 | 26080.7 | 9673.5 | 2764 | 19138 | 25884 | 32986 | 73094 | ▃▇▅▁▁ |
| year | 0 | 1 | 2015.6 | 3.2 | 2010 | 2013 | 2016 | 2018 | 2021 | ▇▆▆▆▇ |
| week | 0 | 1 | 26.6 | 15.1 | 1 | 14 | 27 | 40 | 53 | ▇▇▇▇▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| day | 0 | 1 | 2010-07-30 | 2021-07-31 | 2016-01-29 12:00:00 | 4020 |
#We can see that there are no missing values. #Prepare the dataframe for the first graph (monthly bike rental)
bike$day = as.Date(bike$day)
#filter the data for 2016-2019
history_bike <- bike %>%
filter(between(day, as.Date("2016-01-01"), as.Date("2019-12-30")))
#calculate the overall monthly average between 2016-2019 as benchmark
expected_bike_pcm <- history_bike %>%
group_by(month) %>%
summarise(expected_rental = mean(bikes_hired))
#calculate the excess rentals and percentage for 2016-2021
#excess_rentals = actual_rentals - expected_rentals
actual_bike_pcm <- bike %>%
filter(between(day, as.Date("2016-01-01"), as.Date("2021-12-30"))) %>%
group_by(year,month) %>%
summarise(actual_rental = mean(bikes_hired)) %>%
left_join(expected_bike_pcm, by = "month") %>%
mutate(excess_rentals = actual_rental - expected_rental)#Prepare the dataframe for the second graph (weekly bike rental)
#calculate the overall weekly average between 2016-2019 as benchmark
expected_bike_pw <- history_bike %>%
group_by(week) %>%
summarise(expected_rentals = mean(bikes_hired))
#calculate the excess rentals and percentage for 2016-2021
actual_bike_pw <- bike %>%
filter(between(day, as.Date("2016-01-01"), as.Date("2021-12-30"))) %>%
group_by(year,week) %>%
summarise(actual_rental = mean(bikes_hired)) %>%
left_join(expected_bike_pw, by = "week") %>%
mutate(excess_rentals = actual_rental - expected_rentals,
excess_rentals_pct = excess_rentals/expected_rentals) month_plot <- actual_bike_pcm %>%
ggplot(aes(x = month)) +
#Plot the actual and expected lines
geom_line(aes(y = actual_rental, group = 1), color = "black", size = 0.2)+
geom_line(aes(y = expected_rental, group = 1), color = "blue", size = 0.6)+
#Colourise the plots using ifelse function
geom_ribbon(aes(group = 1, ymin = ifelse(actual_rental <= expected_rental, actual_rental, expected_rental),ymax = expected_rental),
fill = "palevioletred3", alpha = 0.4)+
geom_ribbon(aes(group = 1, ymin = ifelse(actual_rental > expected_rental, expected_rental, actual_rental),ymax = actual_rental),
fill = "green", alpha = 0.4)+
facet_wrap(~year)+
theme_minimal()+
theme(axis.text.x = element_text(size = 5), axis.title.y = element_text(size=9), plot.title = element_text(size=9),
plot.subtitle = element_text(size=9), plot.caption = element_text(size=5))+
labs(title = "Monthly changes in TfL bike rentals", subtitle = "Change from monthly average shown in blue
and calculated between 2016-2019",y= "Bike rentals", x = "", caption = "Source: TfL, London Data Store")+
NULL
month_plotactual_bike_pw <- actual_bike_pw %>%
filter(!(year=="2021" & week > 27))
x_axis_color = ifelse(actual_bike_pw$excess_rentals_pct > 0 , "green", "red")
week_plot <- actual_bike_pw %>%
ggplot(aes(x = week)) +
#Plot the excess rental line
geom_line(aes(group = 1, y = excess_rentals_pct), color = "black", size = 0.3)+
#Fill the area between the x-axis and the line
geom_ribbon(aes(group = 1, ymin = ifelse(excess_rentals_pct >0, 0, excess_rentals_pct), ymax = excess_rentals_pct),
fill = "green", alpha = 0.3)+
geom_ribbon(aes(group = 1, ymin = ifelse(excess_rentals_pct <=0, excess_rentals_pct, 0), ymax = 0),
fill = "palevioletred3", alpha = 0.3)+
#Plot the coloured ticks on x-axis
geom_rug(color = ifelse(actual_bike_pw$excess_rentals_pct > 0 , "green", "palevioletred3"), alpha = 0.9, size = 0.3) +
#Turn y axis into percentage scale and format the major ticks on x-axis according to the sample plot
scale_y_continuous(labels = scales::percent, limits = c(-0.6,1.1))+
scale_x_continuous(breaks = c(13,26,39,53), limits = c(0,53))+
facet_wrap(~year)+
theme_minimal()+
#Plot the gray rectangles in the grid
annotate("rect",xmin = 13, xmax = 26, ymin = -Inf, ymax = Inf, fill = "grey", alpha = 0.3) +
annotate("rect",xmin = 39, xmax = 53, ymin = -Inf, ymax = Inf, fill = "grey", alpha = 0.3) +
#Labels
labs(title = "Weekly changes in TfL bike rentals", subtitle = "% change from weekly averages
calculated between 2016-2019",
y= "", x = "week", caption = "Source: TfL, London Data Store")+
theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size=9), plot.title = element_text(size=9),
plot.subtitle = element_text(size=9), plot.caption = element_text(size=5))+
NULL
week_plot # Deliverables
As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else? * Yes